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ABSTRACT 

Personalized web services strive to adapt their services (advertise- 
ments, news articles, etc.) to individual users by making use of 
both content and user information. Despite a few recent advances, 
this problem remains challenging for at least two reasons. First, 
web service is featured with dynamically changing pools of con- 
tent, rendering traditional collaborative filtering methods inappli- 
cable. Second, the scale of most web services of practical interest 
calls for solutions that are both fast in learning and computation. 

In this work, we model personalized recommendation of news 
articles as a contextual bandit problem, a principled approach in 
which a learning algorithm sequentially selects articles to serve 
users based on contextual information about the users and articles, 
while simultaneously adapting its article-selection strategy based 
on user-click feedback to maximize total user clicks. 

The contributions of this work are three-fold. First, we propose 
a new, general contextual bandit algorithm that is computationally 
efficient and well motivated from learning theory. Second, we ar- 
gue that any bandit algorithm can be reliably evaluated offline us- 
ing previously recorded random traffic. Finally, using this offline 
evaluation method, we successfully applied our new algorithm to 
a Yahoo! Front Page Today Module dataset containing over 33 
million events. Results showed a 12.5% click lift compared to a 
standard context-free bandit algorithm, and the advantage becomes 
even greater when data gets more scarce. 
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I. INTRODUCTION 

This paper addresses the challenge of identifying the most appro- 
priate web-based content at the best time for individual users. Most 
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service vendors acquire and maintain a large amount of content in 
their repository, for instance, for filtering news articles 1141 or for 
the display of advertisements f5l. Moreover, the content of such a 
web-service repository changes dynamically, undergoing frequent 
insertions and deletions. In such a setting, it is crucial to quickly 
identify interesting content for users. For instance, a news filter 
must promptly identify the popularity of breaking news, while also 
adapting to the fading value of existing, aging news stories. 

It is generally difficult to model popularity and temporal changes 
based solely on content information. In practice, we usually ex- 
plore the unknown by collecting consumers' feedback in real time 
to evaluate the popularity of new content while monitoring changes 
in its value [3|. For instance, a small amount of traffic can be des- 
ignated for such exploration. Based on the users' response (such 
as clicks) to randomly selected content on this small slice of traf- 
fic, the most popular content can be identified and exploited on the 
remaining traffic. This strategy, with random exploration on an e 
fraction of the traffic and greedy exploitation on the rest, is known 
as e-greedy. Advanced exploration approaches such as EXP3 (8) 
or UCB1 |7 | could be applied as well. Intuitively, we need to dis- 
tribute more traffic to new content to learn its value more quickly, 
and fewer users to track temporal changes of existing content. 

Recently, personalized recommendation has become a desirable 
feature for websites to improve user satisfaction by tailoring con- 
tent presentation to suit individual users' needs 1 10|. Personal- 
ization involves a process of gathering and storing user attributes, 
managing content assets, and, based on an analysis of current and 
past users' behavior, delivering the individually best content to the 
present user being served. 

Often, both users and content are represented by sets of fea- 
tures. User features may include historical activities at an aggre- 
gated level as well as declared demographic information. Content 
features may contain descriptive information and categories. In this 
scenario, exploration and exploitation have to be deployed at an in- 
dividual level since the views of different users on the same con- 
tent can vary significantly. Since there may be a very large number 
of possible choices or actions available, it becomes critical to rec- 
ognize commonalities between content items and to transfer that 
knowledge across the content pool. 

Traditional recommender systems, including collaborative fil- 
tering, content-based filtering and hybrid approaches, can provide 
meaningful recommendations at an individual level by leveraging 
users' interests as demonstrated by their past activity. Collaborative 
filtering [25], by recognizing similarities across users based on their 
consumption history, provides a good recommendation solution to 
the scenarios where overlap in historical consumption across users 
is relatively high and the content universe is almost static. Content- 
based filtering helps to identify new items which well match an 



existing user's consumption profile, but the recommended items 
are always similar to the items previously taken by the user |20|. 
Hybrid approaches 1111 have been developed by combining two 
or more recommendation techniques; for example, the inability of 
collaborative filtering to recommend new items is commonly alle- 
viated by combining it with content-based filtering. 

However, as noted above, in many web-based scenarios, the con- 
tent universe undergoes frequent changes, with content popular- 
ity changing over time as well. Furthermore, a significant num- 
ber of visitors are likely to be entirely new with no historical con- 
sumption record whatsoever; this is known as a cold-start situa- 
tion |21 1. These issues make traditional recommender-system ap- 
proaches difficult to apply, as shown by prior empirical studies 1 12 1. 
It thus becomes indispensable to learn the goodness of match be- 
tween user interests and content when one or both of them are new. 
However, acquiring such information can be expensive and may 
reduce user satisfaction in the short term, raising the question of 
optimally balancing the two competing goals: maximizing user sat- 
isfaction in the long run, and gathering information about goodness 
of match between user interests and content. 

The above problem is indeed known as a feature-based explo- 
ration/exploitation problem. In this paper, we formulate it as a con- 
textual bandit problem, a principled approach in which a learning 
algorithm sequentially selects articles to serve users based on con- 
textual information of the user and articles, while simultaneously 
adapting its article-selection strategy based on user-click feedback 
to maximize total user clicks in the long run. We define a bandit 
problem and then review some existing approaches in Section [2] 
Then, we propose a new algorithm, LinUCB, in Section [3] which 
has a similar regret analysis to the best known algorithms for com- 
peting with the best linear predictor, with a lower computational 
overhead. We also address the problem of offline evaluation in 
Section [4] showing this is possible for any explore/exploit strat- 
egy when interactions are independent and identically distributed 
(i.i.d.), as might be a reasonable assumption for different users. We 
then test our new algorithm and several existing algorithms using 
this offline evaluation strategy in Section[5] 

2. FORMULATION & RELATED WORK 

In this section, we define the if-armed contextual bandit prob- 
lem formally, and as an example, show how it can model the per- 
sonalized news article recommendation problem. We then discuss 
existing methods and their limitations. 

2.1 A Multi-armed Bandit Formulation 

The problem of personalized news article recommendation can 
be naturally modeled as a multi-armed bandit problem with context 
information. Following previous work 1 18|, we call it a contextual 
bandit^ Formally, a contextual-bandit algorithm A proceeds in dis- 
crete trials t — 1, 2, 3, ... In trial t: 

1. The algorithm observes the current user ut and a set At of 
arms or actions together with their feature vectors Xt^a for 
a £ At- The vector Xt,a summarizes information of both the 
user Ut and arm a, and will be referred to as the context. 

2. Based on observed payoffs in previous trials, A chooses an 
arm at G At, and receives payoff rt,at whose expectation 
depends on both the user ut and the arm at ■ 

3. The algorithm then improves its arm-selection strategy with 
the new observation, (xt,at , at,rt,at)- It is important to em- 



phasize here that no feedback (namely, the payoff rt.a) is 
observed for unchosen arms a ^ at- The consequence of 
this fact is discussed in more details in the next subsection. 
In the process above, the total T-trial payoff of A is defined as 
Yl't=i '''t,at ■ Similarly, we define the optimal expected T-trial pay- 
off as E I^X^tLi '"t.ajj 1 where a* is the arm with maximum ex- 
pected payoff at trial t. Our goal is to design A so that the expected 
total payoff above is maximized. Equivalently, we may find an al- 
gorithm so that its regret with respect to the optimal arm-selection 
strategy is minimized. Here, the T-trial regret Ra{T) of algorithm 
A is defined formally by 



Ra{T) E 



-E 



E 
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' In the literature, contextual bandits are sometimes called bandits 
with covariate, bandits with side information, associative bandits, 
and associative reinforcement learning. 



An important special case of the general contextual bandit prob- 
lem is the well-known K -armed bandit in which (i) the arm set At 
remains unchanged and contains K arms for all t, and (ii) the user 
Ut (or equivalently, the context (xt^i, • • • , xt^jf )) is the same for 
all t. Since both the arm set and contexts are constant at every trial, 
they make no difference to a bandit algorithm, and so we will also 
refer to this type of bandit as a context-fi-ee bandit. 

In the context of article recommendation, we may view articles 
in the pool as arms. When a presented article is clicked, a payoff 
of 1 is incurred; otherwise, the payoff is 0. With this definition 
of payoff, the expected payoff of an article is precisely its click- 
through rate (CTR), and choosing an article with maximum CTR 
is equivalent to maximizing the expected number of clicks from 
users, which in turn is the same as maximizing the total expected 
payoff in our bandit formulation. 

Furthermore, in web services we often have access to user infor- 
mation which can be used to infer a user's interest and to choose 
news articles that are probably most interesting to her. For example, 
it is much more likely for a male teenager to be interested in an arti- 
cle about iPod products rather than retirement plans. Therefore, we 
may "summarize" users and articles by a set of informative features 
that describe them compactly. By doing so, a bandit algorithm can 
generalize CTR information from one article/user to another, and 
learn to choose good articles more quickly, especially for new users 
and articles. 

2.2 Existing Bandit Algorithms 

The fundamental challenge in bandit problems is the need for 
balancing exploration and exploitation. To minimize the regret in 
Eq. (O, an algorithm A exploits its past experience to select the arm 
that appears best. On the other hand, this seemingly optimal arm 
may in fact be suboptimal, due to imprecision in A's knowledge. In 
order to avoid this undesired situation, A has to explore by actually 
choosing seemingly suboptimal arms so as to gather more informa- 
tion about them (c.f., step[3]in the bandit process defined in the pre- 
vious subsection). Exploration can increase short-term regret since 
some suboptimal arms may be chosen. However, obtaining infor- 
mation about the arms' average payoffs (i.e., exploration) can re- 
fine A's estimate of the arms' payoffs and in turn reduce long-term 
regret. Clearly, neither a purely exploring nor a purely exploiting 
algorithm works best in general, and a good tradeoff is needed. 

The context-free A'-armed bandit problem has been studied by 
statisticians for a long time \9j .24.' 26\. One of the simplest and 
most straightforward algorithms is e-g reedy. In each trial t, this 
algorithm first estimates the average payoff fit,a of each arm a. 
Then, with probability 1 — e, it chooses the greedy arm (i.e., the 
arm with highest payoff estimate); with probability e, it chooses a 
random arm. In the limit, each arm will be tried infinitely often. 



and so the payoff estimate [it. a converges to the true value /ia with 
probability 1. Furthermore, by decaying e appropriately {e.g., (24| ), 
the per-step regret, R^{T) /T, converges to with probability 1. 

In contrast to the unguided exploration strategy adopted by e- 
greedy, another class of algorithms generally known as upper con- 
fidence bound algorithms |;4l [T] 1171 use a smarter way to balance 
exploration and exploitation. Specifically, in trial t, these algo- 
rithms estimate both the mean payoff jit, a of each arm a as well 
as a corresponding confidence interval ct.a, so that \ ftt,a — /ia| < 
Ct,a holds with high probability. They then select the arm that 
achieves a highest upper confidence bound (UCB for short): at — 
arg maxa {ftt.a + Ct.a). With appropriately defined confidence in- 
tervals, it can be shown that such algorithms have a small total T- 
trial regret that is only logarithmic in the total number of trials T, 
which turns out to be optimal fT7 l. 

While context-free A'-armed bandits are extensively studied and 
well understood, the more general contextual bandit problem has 
remained challenging. The EXP4 algorithm |8| uses the exponen- 
tial weighting technique to achieve an 0(\/r) regretQbut the com- 
putational complexity may be exponential in the number of fea- 
tures. Another general contextual bandit algorithm is the epoch- 
greedy algorithm 1 18| that is similar to e-greedy with shrinking 
e. This algorithm is computationally efficient given an oracle opti- 
mizer but has the weaker regret guarantee of 0(T^'''^). 

Algorithms with stronger regret guarantees may be designed un- 
der various modeling assumptions about the bandit. Assuming the 
expected payoff of an arm is linear in its features, Auer [6] de- 
scribes the LinRel algorithm that is essentially a UCB-type ap- 
proach and shows that one of its variants has a regret of 0{\/T), a 
significant improvement over earlier algorithms 1 1 1 . 

Finally, we note that there exist another class of bandit al- 
gorithms based on Bayes rule, such as Gittins index meth- 
ods |15|. With appropriately defined prior distributions, Bayesian 
approaches may have good performance. These methods require 
extensive offline engineering to obtain good prior models, and are 
often computationally prohibitive without coupling with approxi- 
mation techniques (2l- 

3. ALGORITHM 

Given asymptotic optimality and the strong regret bound of UCB 
methods for context-free bandit algorithms, it is tempting to de- 
vise similar algorithms for contextual bandit problems. Given some 
parametric form of payoff function, a number of methods exist to 
estimate from data the confidence interval of the parameters with 
which we can compute a UCB of the estimated arm payoff. Such 
an approach, however, is expensive in general. 

In this work, we show that a confidence interval can be com- 
puted efficiently in closed form when the payoff model is linear, 
and call this algorithm LinUCB. For convenience of exposition, we 
first describe the simpler form for disjoint linear models, and then 
consider the general case of hybrid models in Section [J!2l We note 
LinUCB is a generic contextual bandit algorithms which applies to 
applications other than personalized news article recommendation. 

3.1 LinUCB with Disjoint Linear Models 

Using the notation of Section lTTl we assume the expected payoff 
of an arm a is linear in its d-dimensional feature xt.a with some 
unknown coefficient vector 0^ ', namely, for all t, 

E[rt,a\^t,a] = xJ.aOl- (2) 

This model is called disjoint since the parameters are not shared 
^Note O(-) is the same as O(-) but suppresses logarithmic factors. 



among different arms. Let Da be a design matrix of dimension 
m X d at trial t, whose rows correspond to m training inputs (e.g., 
m contexts that are observed previously for article a), and ha £ 
R™ be the corresponding response vector (e.g., the corresponding 
m click/no-click user feedback). Applying ridge regression to the 
training data (Da, Ca) gives an estimate of the coefficients: 

fla = (DlDa+Id)"'Djca, (3) 

where 1^ is the d x d identity matrix. When components in Ca are 
independent conditioned on corresponding rows in Da, it can be 
shown 1271 that, with probability at least 1 — 5, 

|xL^a - E[rt,a|Xt,a]| < xJJDJ D a + Id)-^Xt ,a (4) 

for any 5 > and Xt,a G R'*, where a = 1 + ^ln(2/5)/2 is a 
constant. In other words, the inequality above gives a reasonably 
tight UCB for the expected payoff of arm a, from which a UCB- 
type arm-selection strategy can be derived: at each trial t, choose 

at arg max ( x^'^a^a + aJxJ^Aa^xt.a ] , (5) 

aeAt \ V ' J 

where Aa ='Da Da + Id. 

The confidence interval in Eq. Q may be motivated and derived 
from other principles. For instance, ridge regression can also be 
interpreted as a Bayesian point estimate, where the posterior dis- 
tribution of the coefficient vector, denoted as p{Oa), is Gaussian 
with mean Oa and covariance Aa ^. Given the current model, the 
predictive variance of the expected payoff Xj'^a^a is evaluated as 

Xt'^aAa ^Xt,a, and then iJxJ^Aa^^t.a becomes the standard de- 
viation. Furthermore, in information theory |19|, the differential 
entropy of p{0a) is defined as — ^ ln((27r)'' det Aa). The entropy 
of p{da) when updated by the inclusion of the new point xt_a then 
becomes — i ln((27r)'' det (Aa + xt.aXj'^a))- The entropy reduc- 
tion in the model posterior is i ln(l + x^'^a Aa ^xt,a). This quan- 
tity is often used to evaluate model improvement contributed from 
xt.a. Therefore, the criterion for arm selection in Eq. ^ can also 
be regarded as an additive trade-off between the payoff estimate 
and model uncertainty reduction. 

Algorithm [T] gives a detailed description of the entire LinUCB 
algorithm, whose only input parameter is a. Note the value of a 
given in Eq. ^ may be conservatively large in some applications, 
and so optimizing this parameter may result in higher total payoffs 
in practice. Like all UCB methods, LinUCB always chooses the 
arm with highest UCB (as in Eq. l[5)). 

This algorithm has a few nice properties. First, its computational 
complexity is linear in the number of arms and at most cubic in 
the number of features. To decrease computation further, we may 
update Aaj in every step (which takes 0{d^) time), but compute 

and cache Qa'^'A" ^ (for all a) periodically instead of in real- 
time. Second, the algorithm works well for a dynamic arm set, 
and remains efficient as long as the size of At is not too large. This 
case is true in many applications. In news article recommendation, 
for instance, editors add/remove articles to/from a pool and the pool 
size remains essentially constant. Third, although it is not the focus 
of the present paper, we can adapt the analysis from 1 6 1 to show the 
following: if the arm set At is fixed and contains K arms, then the 
confidence interval (i.e., the right-hand side of Eq. Q) decreases 
fast enough with more and more data, and then prove the strong 
regret bound of 0{V KdT), matching the state-of-the-art result (6) 
for bandits satisfying Eq. l|2j. These theoretical results indicate 
fundamental soundness and efficiency of the algorithm. 



Algorithm 1 LinUCB with disjoint linear models. 



Algorithm 2 LinUCB with hybrid linear models. 



0: Inputs: a e R+ 

1: fort = 1,2,3, ... ,T do 

2: Observe features of all arms a G At'- Xt.a £ R"^ 
3: for all a e At do 
4: if a is new then 

5: Aa Id (d-dimensional identity matrix) 

6: ha <— Odxi (d-dimensional zero vector) 

7: end if 

8: 0a ^ Aa^ha 

9: Pt,a <- Oa Xt,a + O xJ^^Aa^Xt,a 

10: end for 

11: Choose arm at — argmaxag^it Pt,a with ties broken arbi- 
trarily, and observe a real-valued payoff rt 

12: Aa^ < Aat -\- Xt,at'^t,at 
13: bat ^ bat + nXt,at 

14: end for 



Finally, we note that, under the assumption that input features 
xt,a were drawn i.i.d. from a normal distribution (in addition to the 
modeling assumption in Eq. ([2}), Pavlidis et al. |22| came up with 
a similar algorithm that uses a least-squares solution 0a instead of 
our ridge-regression solution (Oa in Eq. lO) to compute the UCB. 
However, our approach (and theoretical analysis) is more general 
and remains valid even when input features are nonstationary. More 
importantly, we will discuss in the next section how to extend the 
basic Algorithm [T] to a much more interesting case not covered by 
Pavlidis et al. 

3.2 LinUCB with Hybrid Linear Models 

Algorithm [T] (or the similar algorithm in [22]) computes the in- 
verse of the matrix, DjDa + Id (or DjDa), where Da is again 
the design matrix with rows corresponding to features in the train- 
ing data. These matrices of all arms have fixed dimension d x d, 
and can be updated efficiently and incrementally. Moreover, their 
inverses can be computed easily as the parameters in Algorithm [T] 
are disjoint: the solution 0a in Eq. ([3) is not affected by training 
data of other arms, and so can be computed separately. We now 
consider the more interesting case with hybrid models. 

In many applications including ours, it is helpful to use features 
that are shared by all arms, in addition to the arm-specific ones. For 
example, in news article recommendation, a user may prefer only 
articles about politics for which this provides a mechanism. Hence, 
it is helpful to have features that have both shared and non-shared 
components. Formally, we adopt the following hybrid model by 
adding another linear term to the right-hand side of Eq. l|2j: 

E[r-t,a|Xt,al = zlaP* +Xt,a0l, (6) 

where Zt,a G R*" is the feature of the current user/article combina- 
tion, and /3* is an unknown coefficient vector common to all arms. 
This model is hybrid in the sense that some of the coefficients /3* 
are shared by all arms, while others 01 are not. 

For hybrid models, we can no longer use Algorithm [T] as the 
confidence intervals of various arms are not independent due to the 
shared features. Fortunately, there is an efficient way to compute 
an UCB along the same line of reasoning as in the previous sec- 
tion. The derivation relies heavily on block matrix inversion tech- 
niques. Due to space limitation, we only give the pseudocode in 
Algorithm |2] (where lines [5l and 1121 compute the ridge-regression 
solution of the coefficients, and line [13] computes the confidence 
interval), and leave detailed derivations to a full paper. Here, we 



0: Inputs: a £ R+ 

1 : Ao Ik (fc-dimensional identity matrix) 

2: bo Ofc (fc-dimensional zero vector) 

3: forf = 1,2,3, ...,r do 

4: Observe features of all arms a £ At'- (zt,a, Xt^a) G R*^'' 

5: yS^Ao^bo 

6: for all a G do 

7: if a is new then 

8: Aa ^ Id (d-dimensional identity matrix) 

9: Ba Odxk {d-hy-k zero matrix) 

10: ba Odxi (d-dimensional zero vector) 

11: end if 

12: 0a ^ Aa^ (ha - BaP) 

13: St,a <- Zt^aAo^Zt.a - 2Zt^a Ao ^B^^ A^ ^Xt ,a + 

Xt,aA-a Xt^a 4" ^t,a-^a BaAp Ba Aa ^t,a 

14: Pt.a zJafi + TiJ^aOa + QyTs^ 

15: end for 

16: Choose arm at = argmaxa6.At Pt,a with ties broken arbi- 
trarily, and observe a real-valued payoff rt 

17: Ao^ Ao + BltA-/Bat 

18: bo ^ bo + B]tA-/bat 

19: Aat Aat ^i,0't'^t,at 

20: Bat Bat +Xt,atZt^at 

21: bat ^ bat + rt^t,at 

22: Ao ^ Ao + Zt,atZ;;at - Bj^Aat'Bat 

23: bo bo + rtZt,at - BJt Aat^bat 

24: end for 



only point out the important fact that the algorithm is computation- 
ally efficient since the building blocks in the algorithm (Ao, bo, 
Aa, Ba, and ba) all have fixed dimensions and can be updated 
incrementally. Furthermore, quantities associated with arms not 
existing in At no longer get involved in the computation. Finally, 
we can also compute and cache the inverses (A[7^ and Aa ^) pe- 
riodically instead of at the end of each trial to reduce the per-trial 
computational complexity to 0(d^ + fc^). 

4. EVALUATION METHODOLOGY 

Compared to machine learning in the more standard supervised 
setting, evaluation of methods in a contextual bandit setting is frus- 
tratingly difficult. Our goal here is to measure the performance of a 
bandit algorithm n, that is, a rule for selecting an arm at each time 
step based on the preceding interactions (such as the algorithms de- 
scribed above). Because of the interactive nature of the problem, it 
would seem that the only way to do this is to actually run the algo- 
rithm on "live" data. However, in practice, this approach is likely to 
be infeasible due to the serious logistical challenges that it presents. 
Rather, we may only have offline data available that was collected 
at a previous time using an entirely different logging policy. Be- 
cause payoffs are only observed for the arms chosen by the logging 
policy, which are likely to often differ from those chosen by the 
algorithm n being evaluated, it is not at all clear how to evaluate 
TT based only on such logged data. This evaluation problem may 
be viewed as a special case of the so-called "off-policy evaluation 
problem" in reinforcement learning (see, c.f., |23;|). 

One solution is to build a simulator to model the bandit process 
from the logged data, and then evaluate tt with the simulator. How- 
ever, the modeling step will introduce bias in the simulator and so 
make it hard to justify the reliability of this simulator-based evalu- 



ation approach. In contrast, we propose an approach that is simple 
to implement, grounded on logged data, and unbiased. 

In this section, we describe a provably reliable technique for car- 
rying out such an evaluation, assuming that the individual events 
are i.i.d., and that the logging policy that was used to gather the 
logged data chose each arm at each time step uniformly at random. 
Although we omit the details, this latter assumption can be weak- 
ened considerably so that any randomized logging policy is allowed 
and our solution can be modified accordingly using rejection sam- 
pling, but at the cost of decreased efficiency in using data. 

More precisely, we suppose that there is some unknown dis- 
tribution D from which tuples are drawn i.i.d. of the form 
(xi, x_ff , ri, . . . , rif), each consisting of observed feature vec- 
tors and hidden payoffs for all arms. We also posit access to a large 
sequence of logged events resulting from the interaction of the log- 
ging policy with the world. Each such event consists of the context 
vectors xi , . . . , , a selected arm a and the resulting observed pay- 
off ra- Crucially, only the payoff ra is observed for the single arm 
a that was chosen uniformly at random. For simplicity of presenta- 
tion, we take this sequence of logged events to be an infinitely long 
stream; however, we also give explicit bounds on the actual finite 
number of events required by our evaluation method. 

Our goal is to use this data to evaluate a bandit algorithm tt. 
Formally, tt is a (possibly randomized) mapping for selecting the 
arm at at time t based on the history ht-i of t — 1 preceding events, 
together with the current context vectors Xti, xtA'. 

Our proposed policy evaluator is shown in Algorithm [3] The 
method takes as input a policy tt and a desired number of "good" 
events T on which to base the evaluation. We then step through 
the stream of logged events one by one. If, given the current his- 
tory ht-i, it happens that the policy n chooses the same arm a as 
the one that was selected by the logging policy, then the event is 
retained, that is, added to the history, and the total payoff Rt up- 
dated. Otherwise, if the policy tt selects a different arm from the 
one that was taken by the logging policy, then the event is entirely 
ignored, and the algorithm proceeds to the next event without any 
other change in its state. 

Note that, because the logging policy chooses each arm uni- 
formly at random, each event is retained by this algorithm with 
probability exactly 1/K, independent of everything else. This 
means that the events which are retained have the same distribution 
as if they were selected by D. As a result, we can prove that two 
processes are equivalent: the first is evaluating the policy against T 
real-world events from D, and the second is evaluating the policy 
using the policy evaluator on a stream of logged events. 

Theorem 1 . For all distributions D of contexts, all policies tt, 
all T, and all sequences of events Iit, 

Pr (/it) = Pr(/iT) 

Policy _Evaluator (it, S) 

where S is a stream of events drawn i.i.d. from a uniform random 
logging policy and D. Furthermore, the expected number of events 
obtained from the stream to gather a history Iit of length T is KT. 

This theorem says that every history hr has the identical prob- 
ability in the real world as in the policy evaluator. Many statistics 
of these histories, such as the average payoff Rt/T returned by 
Algorithm [3] are therefore unbiased estimates of the value of the 
algorithm tt. Further, the theorem states that KT logged events are 
required, in expectation, to retain a sample of size T. 

Proof. The proof is by induction on t = 1, . . . , T starting with 
a base case of the empty history which has probability 1 when t = 



Algorithm 3 Policy _Evaluator. 



0: Inputs: T > 0; policy tt; stream of events 

1: /lo {An initially empty history} 

2: _Ro {An initially zero total payoff} 

3: forf = 1,2,3, ...,r do 

4: repeat 

5: Get next event (xi, xk, a, ra) 

6: until 7r(/it_i, (xi, xa')) = a 

7: ht <" CONCATENATE(/lf_i, (xi, xx,a, ra)) 

8: Rt ^ Rt-i + ra 



9: end for 

10: Output: Rt/T 



under both methods of evaluation, 
that we have for all t — 1: 

Pr {ht 
Policy_E valuator ( ir , s ) 



In the inductive case, assume 



PT{ht 

TT.D 







and want to prove the same statement for any history ht. Since the 
data is i.i.d. and any randomization in the policy is independent of 
randomization in the world, we need only prove that conditioned 
on the history /it-i the distribution over the t-\h event is the same 
for each process. In other words, we must show: 

Pr ((xt,i, Xt,A', ffl, rt,a) I ht-i) 

Policy_E valuator ( tt , s ) 

= Pr(xt,i, ...,Xt,A,rt,a) Pr (a | Xt,i, Xt,A). 

D 7r(ht_i) 

Since the arm a is chosen uniformly at random in the logging pol- 
icy, the probability that the policy evaluator exits the inner loop is 
identical for any policy, any history, any features, and any arm, im- 
plying this happens for the last event with the probability of the 
last event, PrD(xt,i, 'x.t,K, rt,a). Similarly, since the policy vr's 
distribution over arms is independent conditioned on the history 
ht-i and features (xt,i, xt^A"), the probability of arm a is just 



Pr 



-i)(a|xt,i 



•,Xt,Jf j 



Finally, since each event from the stream is retained with proba- 
bility exactly 1/K, the expected number required to retain T events 
is exactly KT. □ 



5. EXPERIMENTS 

In this section, we verify the capacity of the proposed LinUCB 
algorithm on a real-world application using the offline evaluation 
method of Section|4l We start with an introduction of the problem 
setting in Yahoo! Today-Module, and then describe the user/item 
attributes we used in experiments. Finally, we define performance 
metrics and report experimental results with comparison to a few 
standard (contextual) bandit algorithms. 

5.1 Yahoo! Today Module 

The Today Module is the most prominent panel on the Yahoo! 
Front Page, which is also one of the most visited pages on the In- 
ternet; see a snapshot in Figure[T] The default "Featured" tab in the 
Today Module highlights one of four high-quality articles, mainly 
news, while the four articles are selected from an hourly-refreshed 
article pool curated by human editors. As illustrated in Figure [T] 
there are four articles at footer positions, indexed by F1-F4. Each 
article is represented by a small picture and a title. One of the four 
articles is highlighted at the story position, which is featured by a 
large picture, a title and a short summary along with related links. 
By default, the article at Fl is highlighted at the story position. A 
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Figure 1: A snapshot of the "Featured" tab in the Today Mod- 
ule on Yahoo! Front Page. By default, the article at Fl position 
is highlighted at the story position. 



user can click on the highlighted article at the story position to read 
more details if she is interested in the article. The event is recorded 
as a story click. To draw visitors' attention, we would like to rank 
available articles according to individual interests, and highlight the 
most attractive article for each visitor at the story position. 

5.2 Experiment Setup 

Tlris subsection gives a detailed description of our experimental 
setup, including data collection, feature construction, performance 
evaluation, and competing algorithms. 

5.2.1 Data Collection 

We collected events from a random bucket in May 2009. Users 
were randomly selected to the bucket with a certain probability per 
visiting view|_| In this bucket, articles were randomly selected from 
the article pool to serve users. To avoid exposure bias at footer 
positions, we only focused on users' interactions with Fl articles 
at the story position. Each user interaction event consists of three 
components: (i) the random article chosen to serve the user, (ii) 
user/article information, and (iii) whether the user clicks on the ar- 
ticle at the story position. Section|4|shows these random events can 
be used to reliably evaluate a bandit algorithm's expected payoff. 

There were about 4.7 million events in the random bucket on 
May 01. We used this day's events (called "tuning data") for model 
validation to decide the optimal parameter for each competing ban- 
dit algorithm. Then we ran these algorithms with tuned parameters 
on a one-week event set (called "evaluation data") in the random 
bucket from May 03-09, which contained about 36 million events. 

5.2.2 Feature Construction 

We now describe the user/article features constructed for our ex- 
periments. Two sets of features for the disjoint and hybrid models, 
respectively, were used to test the two forms of LinUCB in Sec- 
tion|3]and to verify our conjecture that hybrid models can improve 
learning speed. 

We start with raw user features that were selected by "support". 
The support of a feature is the fraction of users having that feature. 
To reduce noise in the data, we only selected features with high 
support. Specifically, we used a feature when its support is at least 
0.1. Then, each user was originally represented by a raw feature 
vector of over 1000 categorical components, which include: (i) de- 
mographic information: gender (2 classes) and age discretized into 
10 segments; (ii) geographic features: about 200 metropolitan lo- 
cations worldwide and U.S. states; and (iii) behavioral categories: 

^^We call it view-based randomization. After refreshing her 
browser, the user may not fall into the random bucket again. 



about 1000 binary categories that summarize the user's consump- 
tion history within Yahoo! properties. Other than these features, no 
other information was used to identify a user. 

Similarly, each article was represented by a raw feature vector of 
about 100 categorical features constructed in the same way. These 
features include: (i) URL categories: tens of classes inferred from 
the URL of the article resource; and (ii) editor categories: tens of 
topics tagged by human editors to summarize the article content. 

We followed a previous procedure f 121 to encode categorical 
user/article features as binary vectors and then normalize each fea- 
ture vector to unit length. We also augmented each feature vector 
with a constant feature of value 1. Now each article and user was 
represented by a feature vector of 83 and 1193 entries, respectively. 

To further reduce dimensionality and capture nonlinearity in 
these raw features, we carried out conjoint analysis based on ran- 
dom exploration data collected in September 2008. Following a 
previous approach to dimensionality reduction 1131 , we projected 
user features onto article categories and then clustered users with 
similar preferences into groups. More specifically: 

• We first used logistic regression (LR) to fit a bilinear model 
for click probability given raw user/article features so that 
<^,„ W</ia approximated the probability that the user u clicks 
on article a, where 0u and <j)a were the corresponding feature 
vectors, and W was a weight matrix optimized by LR. 

• Raw user features were then projected onto an induced space 

by computing tjju'=(jyl'W . Here, the i**^ component in tjju 
for user u may be interpreted as the degree to which the user 
likes the i"^ category of articles. K-means was applied to 
group users in the induced i/j^ space into 5 clusters. 

• The final user feature was a six-vector: five entries corre- 
sponded to membership of that user in these 5 clusters (com- 
puted with a Gaussian kernel and then normalized so that 
they sum up to unity), and the sixth was a constant feature L 

At trial t, each article a has a separate six-dimensional feature Xi,a 
that is exactly the six-dimensional feature constructed as above for 
user ut- Since these article features do not overlap, they are for 
disjoint linear models defined in SectionjS] 

For each article a, we performed the same dimensionality reduc- 
tion to obtain a six-dimensional article feature (including a constant 
1 feature). Its outer product with a user feature gave 6 x 6 = 36 
features, denoted zt,a G R'^'', that corresponded to the shared fea- 
tures in Eq. and thus {zt.a,'x.t,a) could be used in the hybrid 
linear model. Note the features zt.a contains user-article interac- 
tion information, while Xt,a contains user information only. 

Here, we intentionally used five users (and articles) groups, 
which has been shown to be representative in segmentation anal- 
ysis 1 13|. Another reason for using a relatively small feature space 
is that, in online services, storing and retrieving large amounts of 
user/article information will be too expensive to be practical. 

5.3 Compared Algoritiims 

The algorithms empirically evaluated in our experiments can be 
categorized into three groups: 

I. Algorithms that make no use of features. These correspond to 
the context-free if-armed bandit algorithms that ignore all contexts 
{i.e., user/article information). 

• random: A random policy always chooses one of the candi- 
date articles from the pool with equal probability. This algo- 
rithm requires no parameters and does not "learn" over time. 

• e-greedy: As described in Section |2^ it estimates each arti- 
cle's CTR; then it chooses a random article with probability 
e, and chooses the article of the highest CTR estimate with 
probability 1 — e. The only parameter of this policy is e. 
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Figure 2: Parameter tuning: CTRs of various algorithms on the one-day tuning dataset. 



• ucb: As described in Section [Z2] this policy estimates each 
article's CTR as well as a confidence interval of the estimate, 
and always chooses the article with the highest UCB. Specifi- 
cally, following UCB1 |7|, we computed an article a's confi- 
dence interval by ct.a = where nt,a is the number of 
times a was chosen prior to trial t, and a > is a parameter. 

• omniscient: Such a policy achieves the best empirical 
context-free CTR from hindsight. It first computes each ar- 
ticle's empirical CTR from logged events, and then always 
chooses the article with highest empircal CTR when it is 
evaluated using the same logged events. This algorithm re- 
quires no parameters and does not "learn" over time. 

II. Algorithms with "warm start" — an intermediate step towards 
personalized services. The idea is to provide an offline-estimated 
user-specific adjustment on articles' context-free CTRs over the 
whole traffic. The offset serves as an initialization on CTR estimate 
for new content, a.k.a."warm start". We re-trained the bilinear lo- 
gistic regression model studied in |T2 | on Sept 2008 random traffic 
data, using features Zt,a constructed above. The selection criterion 
then becomes the sum of the context-free CTR estimate and a bi- 
linear term for a user-specific CTR adjustment. In training, CTR 
was estimated using the context-free e-greedy with e — 1. 

• e-greedy (warm): This algorithm is the same as e-greedy 
except it adds the user-specific CTR correction to the article's 
context-free CTR estimate. 

• ucb (warm): This algorithm is the same as the previous one 
but replaces e-greedy with ucb. 

III. Algorithms that learn user-specific CTRs online. 

• e-greedy (seg): Each user is assigned to the closest user 
cluster among the five constructed in Section |5.2.2| and so all 
users are partitioned into five groups (a.k.a. user segments), 
in each of which a separate copy of e-greedy was run. 



• ucb (seg): This algorithm is similar to e-greedy (seg) ex- 
cept it ran a copy of UCb in each of the five user segments. 

• e-greedy (disjoint): This is e-greedy with disjoint models, 
and may be viewed as a close variant of epoch-greedy [ 18 1. 

• Iinucb (disjoint): This is Algorithm[T]with disjoint models. 

• e-greedy (hybrid): This is e-greedy with hybrid models, 
and may be viewed as a close variant of epoch-greedy. 

• Iinucb (hybrid): This is Algorithm|2]with hybrid models. 

5.4 Performance Metric 

An algorithm's CTR is defined as the ratio of the number of 
clicks it receives and the number of steps it is run. We used all 
algorithms' CTRs on the random logged events for performance 
comparison. To protect business-sensitive information, we report 
an algorithm's relative CTR, which is the algorithm's CTR divided 
by the random policy's. Therefore, we will not report a random pol- 
icy's relative CTR as it is always 1 by definition. For convenience, 
we will use the term "CTR" from now on instead of "relative CTR". 

For each algorithm, we are interested in two CTRs motivated 
by our application, which may be useful for other similar applica- 
tions. When deploying the methods to Yahoo!'s front page, one 
reasonable way is to randomly split all traffic to this page into two 
buckets |3j. The first, called "learning bucket", usually consists of 
a small fraction of traffic on which various bandit algorithms are 
run to leam/estimate article CTRs. The other, called "deployment 
bucket", is where Yahoo! Front Page greedily serves users using 
CTR estimates chained from the learning bucket. Note that "learn- 
ing" and "deployment" are interleaved in this problem, and so in 
every view falling into the deployment bucket, the article with the 
highest current (user-specific) CTR estimate is chosen; this esti- 
mate may change later if the learning bucket gets more data. CTRs 
in both buckets were estimated with Algorithm|3] 
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Table 1: Performance evaluation: CTRs of all algorithms on the one-week evaluation dataset in the deployment and learning buckets 
(denoted by "deploy" and "learn" in the table, respectively). The numbers with a percentage is the CTR lift compared to e-greedy. 



Since the deployment bucket is often larger than the learning 
bucket, CTR in the deployment bucket is more important. How- 
ever, a higher CTR in the learning bucket suggests a faster learning 
rate (or equivalently, smaller regret) for a bandit algorithm. There- 
fore, we chose to report algorithm CTRs in both buckets. 

5.5 Experimental Results 

5.5.1 Results for Tuning Data 

Each of the competing algorithms (except random and omni- 
scient) in Section 1531 requires a single parameter: e for e-greedy 
algorithms and a for UCB ones. We used tuning data to optimize 
these parameters. Figure [2] shows how the CTR of each algorithm 
changes with respective parameters. All results were obtained by 
a single run, but given the size of our dataset and the unbiasedness 
result in Theorem[T] the reported numbers are statistically reliable. 

First, as seen from Figure|2] the CTR curves in the learning buck- 
ets often possess the inverted U-shape. When the parameter (e or 
a) is too small, there was insufficient exploration, the algorithms 
failed to identify good articles, and had a smaller number of clicks. 
On the other hand, when the parameter is too large, the algorithms 
appeared to over-explore and thus wasted some of the opportunities 
to increase the number of clicks. Based on these plots on tuning 
data, we chose appropriate parameters for each algorithm and ran 
it once on the evaluation data in the next subsection. 

Second, it can be concluded from the plots that warm-start in- 
formation is indeed helpful for finding a better match between user 
interest and article content, compared to the no-feature versions of 
e-greedy and UCB. Specifically, both e-greedy (warm) and UCb 
(warm) were able to beat omniscient, the highest CTRs achiev- 
able by context-free policies in hindsight. However, performance 
of the two algorithms using warm-start information is not as stable 
as algorithms that learn the weights online. Since the offline model 
for "warm start" was trained with article CTRs estimated on all ran- 
dom traffic 1121 . e-greedy (warm) gets more stable performance 
in the deployment bucket when e is close to 1. The warm start part 
also helps UCb (warm) in the learning bucket by selecting more at- 
tractive articles to users from scratch, but did not help UCb (warm) 
in determining the best online for deployment. Since UCb relies 
on the a confidence interval for exploration, it is hard to correct 
the initialization bias introduced by "warm start". In contrast, all 
online-learning algorithms were able to consistently beat the omni- 
scient policy. Therefore, we did not try the warm-start algorithms 
on the evaluation data. 



Third, e-greedy algorithms (on the left of Figure[2} achieved sim- 
ilar CTR as upper confidence bound ones (on the right of Figure^ 
in the deployment bucket when appropriate parameters were used. 
Thus, both types of algorithms appeared to learn comparable poli- 
cies. However, they seemed to have lower CTR in the learning 
bucket, which is consistent with the empirical findings of context- 
free algorithms |2| in real bucket tests. 

Finally, to compare algorithms when data are sparse, we repeated 
the same parameter tuning process for each algorithm with fewer 
data, at the level of 30%, 20%, 10%, 5%, and 1%. Note that we 
still used all data to evaluate an algorithm's CTR as done in Algo- 
rithmic] but then only a fraction of available data were randomly 
chosen to be used by the algorithm to improve its policy. 

5.5.2 Results for Evaluation Data 

With parameters optimized on the tuning data (c.f., Figure|2}, we 
ran the algorithms on the evaluation data and summarized the CTRs 
in Table [T] The table also reports the CTR lift compared to the 
baseline of e-greedy. The CTR of omniscient was 1.615, and so 
a significantly larger CTR of an algorithm indicates its effective use 
of user/article features for personalization. Recall that the reported 
CTRs were normalized by the random policy's CTR. We examine 
the results more closely in the following subsections. 

On the Use of Features. 

We first investigate whether it helps to use features in article rec- 
ommendation. It is clear from Table [T] that, by considering user 
features, both e-greedy (seg/disjoint/hybrid) and UCB methods 
(ucb (seg) and linucb (disjoint/hybrid)) were able to achieve a 
CTR lift of around 10%, compared to the baseline e-greedy. 

To better visualize the effect of features, Figure[3] shows how an 
article's CTR (when chosen by an algorithm) was lifted compared 
to its base CTR (namely, the context-free CTR)Q Here, an article's 
base CTR measures how interesting it is to a random user, and was 
estimated from logged events. Therefore, a high ratio of the lifted 
and base CTRs of an article is a strong indicator that an algorithm 
does recommend this article to potentially interested users. Fig- 
ure [3(a)] shows neither e-greedy nor UCb was able to lift article 
CTRs, since they made no use of user information. In contrast, all 

'^To avoid inaccurate CTR estimates, only 50 articles that were 
chosen most often by an algorithm were included in its own plots. 
Hence, the plots for different algorithms are not comparable. 
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Figure 3: Scatterplots of the base CTR vs. lifted CTR (in the learning bucket) of the 50 most frequently selected articles when 100% 
evaluation data were used. Red crosses are for e-greedy algorithms, and blue circles are for UCB algorithms. Note that the sets of 
most frequently chosen articles varied with algorithms; see the text for details. 



the other three plots show clear benefits by considering personal- 
ized recommendation. In an extreme case (Figure [3(c)^ , one of the 
article's CTR was lifted from 1.31 to 3.03 — a 132% improvement. 

Furthermore, it is consistent with our previous results on tuning 
data that, compared to e-greedy algorithms, UCB methods achieved 
higher CTRs in the deployment bucket, and the advantage was even 
greater in the learning bucket. As mentioned in Section 12.21 e- 
greedy approaches are unguided because they choose articles uni- 
formly at random for exploration. In contrast, exploration in upper 
confidence bound methods are effectively guided by confidence 
intervals — a measure of uncertainty in an algorithm's CTR esti- 
mate. Our experimental results imply the effectiveness of upper 
confidence bound methods and we believe they have similar bene- 
fits in many other applications as well. 

On the Size of Data. 

One of the challenges in personalized web services is the scale 
of the applications. In our problem, for example, a small pool of 
news articles were hand-picked by human editors. But if we wish 
to allow more choices or use automated article selection methods 
to determine the article pool, the number of articles can be too large 
even for the high volume of Yahoo! traffic. Therefore, it becomes 
critical for an algorithm to quickly identify a good match between 
user interests and article contents when data are sparse. In our ex- 
periments, we artificially reduced data size (to the levels of 30%, 
20%, 10%, 5%, and 1%, respectively) to mimic the situation where 
we have a large article pool but a fixed volume of traffic. 

To better visualize the comparison results, we use bar graphs in 
Figure |4] to plot all algorithms' CTRs with various data sparsity 
levels. A few observations are in order. First, at all data sparsity 
levels, features were still useful. At the level of 1%, for instance, 
we observed a 10.3% improvement of linucb (hybrid)'s CTR in the 
deployment bucket (1.493) over ucb's (1.354). 

Second, UCB methods consistently outperformed e-greedy ones 
in the deployment bucket|3 The advantage over e-greedy was even 
more apparent when data size was smaller. 

Third, compared to ucb (seg) and linucb (disjoint), linucb (hy- 
brid) showed significant benefits when data size was small. Re- 
call that in hybrid models, some features are shared by all articles, 
making it possible for CTR information of one article to be "trans- 
ferred" to others. This advantage is particularly useful when the 
article pool is large. In contrast, in disjoint models, feedback of 



'in the less important learning bucket, there were two exceptions 
for linucb (disjoint). 



one article may not be utilized by other articles; the same is true for 
ucb (seg). Figure [4(a)] shows transfer learning is indeed helpful 
when data are sparse. 

Comparing ucb (seg) and linucb (disjoint). 

From Figure [4(a)| it can be seen that UCb (seg) and llnucb (dis- 
joint) had similar performance. We believe it was no coincidence. 
Recall that features in our disjoint model are actually normalized 
membership measures of a user in the five clusters described in 
Section [5.2.21 Hence, these features may be viewed as a "soft" 
version of the user assignment process adopted by UCb (seg). 

Figure [5] plots the histogram of a user's relative membership 
measure to the closest cluster, namely, the largest component of the 
user's five, non-constant features. It is clear that most users were 
quite close to one of the five cluster centers: the maximum mem- 
bership of about 85% users were higher than 0.5, and about 40% of 
them were higher than 0.8. Therefore, many of these features have 
a highly dominating component, making the feature vector similar 
to the "hard" version of user group assignment. 

We believe that adding more features with diverse components, 
such as those found by principal component analysis, would be nec- 
essary to further distinguish linucb (disjoint) from UCb (seg). 



6. CONCLUSIONS 

This paper takes a contextual-bandit approach to personalized 
web-based services such as news article recommendation. We pro- 
posed a simple and reliable method for evaluating bandit algo- 
rithms directly from logged events, so that the often problematic 
simulator-building step could be avoided. Based on real Yahoo! 
Front Page traffic, we found that upper confidence bound methods 
generally outperform the simpler yet unguided e-greedy methods. 
Furthermore, our new algorithm LinUCB shows advantages when 
data are sparse, suggesting its effectiveness to personalized web 
services when the number of contents in the pool is large. 

In the future, we plan to investigate bandit approaches to other 
similar web-based serviced such as online advertising, and com- 
pare our algorithms to related methods such as Banditron |16|. A 
second direction is to extend the bandit formulation and algorithms 
in which an "arm" may refer to a complex object rather than an 
item (like an article). An example is ranking, where an arm corre- 
sponds to a permutation of retrieved webpages. Finally, user inter- 
ests change over time, and so it is interesting to consider temporal 
information in bandit algorithms. 
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(a) CTRs in the deployment bucket. 
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(b) CTRs in the learning bucket. 
Figure 4: CTRs in evaluation data witli varying data sizes. 
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